biostatfandomcom-20200213-history
Hints in stats
Some hints in stats... always good to know about... How to choose a model In order to choose a model, one should look at: *size of training data *number, sparsity & general distribution shape of your features *feature & label noise *correlated & irrelevant features *potential nonlinearities & feature interactions Methods such as random forest or SVM are competitive across a very wide range of problems. Here are some suggestions: *Output is: : a real number: regression, : a binary variable: binary classification, : a categorical variable: multiclass classification, : a set of categorical variables: multi-label prediction, : a sequence: tagging : a general object: structured prediction *Dimensionality and Sparsity: For high-dimensional sparse data, SVM or SVM-based methods like Pegasos or neural network methods like the passive-aggressive Perceptron *Linear separability: For highly non-linear datasets, ANNs, Decision Trees or kernel SVMs *Scalability (how many training instances do you have): stochastic and randomized algorithms. *Prediction speed (how many predictions per second do you need to do): linear models penalized with a sparsity-inducing norm *Memory requirement (e.g. in mobile phones): online algorithms. Be careful when testing for normality When the sample size is small, even big departures from normality are not detected, and when your sample size is large, even the smallest deviation from normality will lead to a rejected null. So methods for testing for normality (such as Shapiro-Wilk test, Anderson-Darling, qqplot, etc) are not as reliable as we may think, as illustrated on the following link. Handling missing data #Simple deletion strategies #"Working around" strategies: for example, the Full Information Maximum Likelihood (FIML) integrates out the missing data when fitting the desired model #Imputation strategies: replacing missing value with an estimate of the actual value of that case. mean imputation consists of replacing the missing value by the mean of the variable in question; expectation Maximization (EM) arrives at the best point estimates of the true values, given the model (which itself is estimated on the basis of the imputed missings); regression-mean imputation replaces the missing value by the conditional regression mean, and multiple imputation, rather than a single imputed value, multiple ones are derived from a prediction equation. Sample size Sample size depends on the distribution of the null hypothesis and the distribution of the alternative hypothesis. As a rule of thumb, if you want your confidence interval smaller by a factor 3 you should increase your sample size by 3^2=9. Some designs: Compare subject to 1.himself * i.e. to baseline * i.e. to alternating on-off treatments 2. to similar subjects (stratified) * i.e. stratum based on clinical similarity 3. to those in similar care (unstratified) * i.e. stratum based on care environment, e.g. hospital, referring clinic, primary care physician 4. Compare one group to another Ways to increase power *Size groups equally *Increase sample size *Study drug efficacy rather than effectivenes Efficacy is not the same as effectiveness.1 A treatment is effective if it works in real life in non-ideal circumstances. In real life, medications will be used in doses and frequencies never studied and in patient groups never assessed in the trials. Drugs will be used in combination with other medications that have not been tested for interactions, and by people other than the patient - the ‘over the garden fence’ syndrome. Effectiveness cannot be measured in controlled trials, because the act of inclusion into a study is a distortion of usual practice. *Effectiveness can be defined as ‘the extent to which a drug achieves its intended effect in the usual clinical setting’.* It can be evaluated through observational studies of real practice. This allows practice to be assessed in qualitative as well as quantitative terms Effects to be taken into account * Hawthorne effect (subjects improve or modify an aspect of their behavior being experimentally measured simply in response to the fact that they are being studied) * Selection and self-selection bias * Placebo effect * Rosenthal effect (the greater the expectation placed upon people, often children or students and employees, the better they perform) * Stratum effect * Cohort and adjunct care effects (particular impact of a group bonded by time or common life experience) * Treatment compliance